top-k operator
- Asia > China > Guangdong Province > Shenzhen (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Differentiable Fast Top-K Selection for Large-Scale Recommendation
Zhu, Yanjie, Zhang, Zhen, Wang, Yunli, Wang, Zhiqiang, Li, Yu, Zhou, Rufan, Wen, Shiyang, Jiang, Peng, Lin, Chenhao, Yang, Jian
Cascade ranking is a widely adopted paradigm in large-scale information retrieval systems for Top-K item selection. However, the Top-K operator is non-differentiable, hindering end-to-end training. Existing methods include Learning-to-Rank approaches (e.g., LambdaLoss), which optimize ranking metrics like NDCG and suffer from objective misalignment, and differentiable sorting-based methods (e.g., ARF, LCRON), which relax permutation matrices for direct Top-K optimization but introduce gradient conflicts through matrix aggregation. A promising alternative is to directly construct a differentiable approximation of the Top-K selection operator, bypassing the use of soft permutation matrices. However, even state-of-the-art differentiable Top-K operator (e.g., LapSum) require $O(n \log n)$ complexity due to their dependence on sorting for solving the threshold. Thus, we propose DFTopK, a novel differentiable Top-K operator achieving optimal $O(n)$ time complexity. By relaxing normalization constraints, DFTopK admits a closed-form solution and avoids sorting. DFTopK also avoids the gradient conflicts inherent in differentiable sorting-based methods. We evaluate DFTopK on both the public benchmark RecFLow and an industrial system. Experimental results show that DFTopK significantly improves training efficiency while achieving superior performance, which enables us to scale up training samples more efficiently. In the online A/B test, DFTopK yielded a +1.77% revenue lift with the same computational budget compared to the baseline. To the best of our knowledge, this work is the first to introduce differentiable Top-K operators into recommendation systems and the first to achieve theoretically optimal linear-time complexity for Top-K selection. We have open-sourced our implementation to facilitate future research in both academia and industry.
- North America > United States > District of Columbia > Washington (0.05)
- Asia > China > Beijing > Beijing (0.05)
- Asia > China > Shaanxi Province > Xi'an (0.04)
- Asia > China > Shandong Province > Dongying (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
- North America > Canada (0.04)
A Related Work on Compression Techniques in Distributed Optimization and Learning
The analysis in this section follows the techniques introduced in [8]. The proof of Proposition 2 follows roughly the same steps as the proof of Proposition 1. In this section, we will compile some results that will prove to be useful later in our analysis. With this in mind, we will assume throughout this section that all clients perform the same number of local updates, i.e., To that end, we will make use of the following lemma. To prove Theorem 5, we will construct an example involving two clients.
Review for NeurIPS paper: Differentiable Top-k with Optimal Transport
Additional Feedback: Some comments: l. 117 The entropic OT is surely not more computationally friendly than a a top-k operator that simply sorts the vector. Same for the beam-search method, the present work seems to be a sequence of ad-hoc definitions rather than a principled objective. In particular it is important to make the optimization objective clear to enable future comparisons. Can the authors clearly distinguish their contributions from the ones of Cuturi et al, 2019? It seems that the implementation of the authors is potentially faster, which should be highlighted.
MoDification: Mixture of Depths Made Easy
Zhang, Chen, Zhong, Meizhi, Wang, Qimeng, Lu, Xuantao, Ye, Zheyu, Lu, Chengqiang, Gao, Yan, Hu, Yao, Chen, Kehai, Zhang, Min, Song, Dawei
Long-context efficiency has recently become a trending topic in serving large language models (LLMs). And mixture of depths (MoD) is proposed as a perfect fit to bring down both latency and memory. In this paper, however, we discover that MoD can barely transform existing LLMs without costly training over an extensive number of tokens. To enable the transformations from any LLMs to MoD ones, we showcase top-k operator in MoD should be promoted to threshold-p operator, and refinement to architecture and data should also be crafted along. All these designs form our method termed MoDification. Through a comprehensive set of experiments covering model scales from 3B to 70B, we exhibit MoDification strikes an excellent balance between efficiency and effectiveness. MoDification can achieve up to ~1.2x speedup in latency and ~1.8x reduction in memory compared to original LLMs especially in long-context applications.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Austria > Vienna (0.14)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.05)
- (17 more...)
Beam Tree Recursive Cells
Chowdhury, Jishnu Ray, Caragea, Cornelia
We propose Beam Tree Recursive Cell (BT-Cell) - a backpropagation-friendly framework to extend Recursive Neural Networks (RvNNs) with beam search for latent structure induction. We further extend this framework by proposing a relaxation of the hard top-k operators in beam search for better propagation of gradient signals. We evaluate our proposed models in different out-of-distribution splits in both synthetic and realistic data. Our experiments show that BTCell achieves near-perfect performance on several challenging structure-sensitive synthetic tasks like ListOps and logical inference while maintaining comparable performance in realistic data against other RvNN-based models. Additionally, we identify a previously unknown failure case for neural models in generalization to unseen number of arguments in ListOps. The code is available at: https://github.com/JRC1995/BeamTreeRecursiveCells.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Oceania > Australia > Victoria > Melbourne (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (19 more...)
Fast, Differentiable and Sparse Top-k: a Convex Analysis Perspective
Sander, Michael E., Puigcerver, Joan, Djolonga, Josip, Peyré, Gabriel, Blondel, Mathieu
The top-k operator returns a sparse vector, where the non-zero values correspond to the k largest values of the input. Unfortunately, because it is a discontinuous function, it is difficult to incorporate in neural networks trained end-to-end with backpropagation. Recent works have considered differentiable relaxations, based either on regularization or perturbation techniques. However, to date, no approach is fully differentiable and sparse. In this paper, we propose new differentiable and sparse top-k operators. We view the top-k operator as a linear program over the permutahedron, the convex hull of permutations. We then introduce a p-norm regularization term to smooth out the operator, and show that its computation can be reduced to isotonic optimization. Our framework is significantly more general than the existing one and allows for example to express top-k operators that select values in magnitude. On the algorithmic side, in addition to pool adjacent violator (PAV) algorithms, we propose a new GPU/TPU-friendly Dykstra algorithm to solve isotonic optimization problems. We successfully use our operators to prune weights in neural networks, to fine-tune vision transformers, and as a router in sparse mixture of experts.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
Stochastic smoothing of the top-K calibrated hinge loss for deep imbalanced classification
Garcin, Camille, Servajean, Maximilien, Joly, Alexis, Salmon, Joseph
In modern classification tasks, the number of labels is getting larger and larger, as is the size of the datasets encountered in practice. As the number of classes increases, class ambiguity and class imbalance become more and more problematic to achieve high top-1 accuracy. Meanwhile, Top-K metrics (metrics allowing K guesses) have become popular, especially for performance reporting. Yet, proposing top-K losses tailored for deep learning remains a challenge, both theoretically and practically. In this paper we introduce a stochastic top-K hinge loss inspired by recent developments on top-K calibrated losses. Our proposal is based on the smoothing of the top-K operator building on the flexible "perturbed optimizer" framework. We show that our loss function performs very well in the case of balanced datasets, while benefiting from a significantly lower computational time than the state-of-the-art top-K loss function. In addition, we propose a simple variant of our loss for the imbalanced case. Experiments on a heavy-tailed dataset show that our loss function significantly outperforms other baseline loss functions.
- North America > Canada > Ontario > Toronto (0.14)
- Europe > France > Occitanie > Hérault > Montpellier (0.05)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Differentiable Top-k Operator with Optimal Transport
Xie, Yujia, Dai, Hanjun, Chen, Minshuo, Dai, Bo, Zhao, Tuo, Zha, Hongyuan, Wei, Wei, Pfister, Tomas
The top-k operation, i.e., finding the k largest or smallest elements from a collection of scores, is an important model component, which is widely used in information retrieval, machine learning, and data mining. However, if the top-k operation is implemented in an algorithmic way, e.g., using bubble algorithm, the resulting model cannot be trained in an end-to-end way using prevalent gradient descent algorithms. This is because these implementations typically involve swapping indices, whose gradient cannot be computed. Moreover, the corresponding mapping from the input scores to the indicator vector of whether this element belongs to the top-k set is essentially discontinuous. To address the issue, we propose a smoothed approximation, namely the SOFT (Scalable Optimal transport-based diFferenTiable) top-k operator. Specifically, our SOFT top-k operator approximates the output of the top-k operation as the solution of an Entropic Optimal Transport (EOT) problem. The gradient of the SOFT operator can then be efficiently approximated based on the optimality conditions of EOT problem. We apply the proposed operator to the k-nearest neighbors and beam search algorithms, and demonstrate improved performance.
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.69)